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Abstract 

Bugs in user input sanitation of software systems often lead to vulnerabilities. Among them 
many are caused by improper use of regular replacement. This paper presents a precise modeling 
of various semantics of regular substitution, such as the declarative, finite, greedy, and reluctant, 
using finite state transducers (FST). By projecting an FST to its input/output tapes, we are able 
to solve atomic string constraints, which can be applied to both the forward and backward image 
computation in model checking and symbolic execution of text processing programs. We report 
several interesting discoveries, e.g., certain fragments of the general problem can be handled using 
less expressive deterministic FST. A compact representation of FST is implemented in SUSHI, a 
string constraint solver. It is applied to detecting vulnerabilities in web applications. 


1 Introduction 

User input sanitation has been widely used by programmers to assure robustness and security of software. 
Regular replacement is one of the most frequently used approaches by programmers. For example, 
at both client and server sides of a web application, it is often used to perform format checking and 
filtering of command injection attack strings. As software bugs in user input sanitation can easily lead to 
vulnerabilities, it is desirable to employ automated analysis techniques for revealing such security holes. 
This paper presents the finite state transducer models of a variety of regular replacement operations, 
geared towards automated analysis of text processing programs. 

One application of the proposed technique is symbolic execution [10]. In [3] we outlined a unified 
symbolic execution framework for discovering command injection vulnerabilities. The target system 
under test is executed as usual except that program inputs arc treated as symbolic literals. A path condi- 
tion is used to record the conditions to be met by the initial input, so that the program will execute to a 
location. At critical points, e.g., where a SQL query is submitted, path conditions arc paired with attack 
patterns. Solving these constraints leads to attack signatures. 


1 <?php 

2 $msg = $_POST[ ”msg” ] ; 

3 $sanitized = preg.repl ace ( ”/\< s cr ip t .*?\ >.*?\ <\/ s c r i p t .*?\ >/ i ” , ”” , $msg ) ; 

4 save_to_db ( $sanitized ) 

5 ?> 


Listing 1: Vulnerable Sanitation against XSS Attack 


In the following, we use an example to demonstrate the idea of the above research and motivate the 
modeling of regular replacement in this paper. Consider a PHP snippet in Listing 1 , which takes a mes- 
sage as input and posts it to a bulletin. To prevent the Cross-Site Scripting (XSS) attack, the programmer 
calls preg_replace () to remove any pair of <script> and </script> tags. Unfortunately, the pro- 
tection is insufficient. Readers can verify that <<script></script>script>alert ( ’ a’ ) </script> 
is an attack string. After preg_replace (), it yields <script>alert ( ’ a’ ) </script>. 

We now show how the attack signature is generated, assuming the availability of symbolic execution. 
By symbolically executing the program, variable $msg is initialized with a symbolic literal and let it be 
x. Assume a is the regular expression <script . *?> . *?</script . *?> and e is the empty string. After 
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line 3, variable $sanitized has a symbolic value represented by string expression x a ^ e , and it is a 
replacement operator that denotes the effects of preg_replace, using the reluctant semantics (see “*?” 
in formula). Then at line 4 where the SQL query is submitted, a string constraint can be constructed as 
below, using an existing attack pattern. The equation asks: can a JavaScript snippet be generated after 
the preg_replace protection? 

Xa->e = <scrr.pt . *?>alert ( ’ a’ ) </script . *?> 

To solve the above equation, we first model the reluctant regular replacement x a _^ ( . as a finite state 
transducer (FST) and let it be si The right hand side (RHS) of the equation is a regular expression, and 
let it be r. It is well known that the identity relation Id(r) = {(w, w) \ w G L(r)} is a regular relation that 
can be recognized by an FST (let it be sif). Now let si be the composition of si\ and sii (by piping the 
output tape of si\ to input tape of sii). Projecting si to its input tape results in a finite state automaton 
(FSA) that represents the solution of x. 

Notice that a precise modeling that distinguishes the various regular replacement semantics is nec- 
essary. For example, a natural question following the above analysis is: If we approximate the reluctant 
semantics using the greedy semantics, could the static analysis be still effective? The answer is negative: 
When the *? operators in Listing 1 arc treated as *. the analysis reports no solution for the equation, i.e., 
a false negative report on the actually vulnerable program. 

In this paper, we present the modeling of regular replacement operations. §2 covers preliminaries. 
§3 and §4 present the modeling of various regular replacement semantics. §5 introduces tool support. §6 
discusses related work. §7 concludes. 

2 Preliminaries 

This section formalizes several typical semantics of regular substitution, and then introduces a variation 
of the standard finite state transducer model. We introduce some notations first. Let £ represent the 
alphabet and R the set of regular - expressions over £. If co G £*, co is called a word. Given a regular 
expression r G R, its language is denoted as L{r). When co G L{r) we say co is an instance of r. We 
sometimes abuse the notation as co G r when the context is clear that r is a regular - expression. A regular 
expression r is said to be finite if L(r) is finite. Clearly, r G R is finite if and only if there exists a 
constant length bound n G N s.t. for any co G L(r), |w| < n. We assume # 0 £ is the begin marker 
and $ ^ £ is the end marker. They will be used in modeling procedural regular - replacement in §4. Let 
£2 = £U {#,$}. Assume T* is a second alphabet which is disjoint with £. Given co G (£U V P)*, n(<o) 
denotes the projection of to to £ s.t. all the symbols in V P are removed from co. Let 0 < i < j <\co\, <o[i,j ] 
represents a substring of co starting from index i and ending at j — 1 (index counts from 0). Similarly, co[i\ 
refers to the element at index i. We use NFST, DFST to denote the nondeterministic and deterministic 
FST, respectively. Similar - are NFSA and DFSA for finite state automata. 

There are three popular - semantics of regular - replacement, namely greedy , reluctant , and possessive, 
provided by many programming languages, e.g., in j ava .utils . regex of J2SE. We concentrate on two 
of them: the greedy and the reluctant. The greedy semantics tries to match a given regular - expression 
pattern with the longest substring of the input while the reluctant semantics works in the opposite way. 
From the theoretical point of view, it is also interesting to define a declarative semantics for string 
replacement. A declarative replacement y r _, w replaces every occurrence of a regular - pattern r with co. 

Definition 2.1. Let y, co G £* and r G R (with e 0 r). The declarative replacement, denoted as y. _ w , is 
defined as: 

f {7} if y 0 £*?•£* 

ir^m <y { Vj _ | y =- v [3 Li and jS Gr} otherwise 

I 
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The greedy and reluctant semantics are also called procedural, because both of them enforce a left- 
most matching. The replacement procedure is essentially a loop which examines each index of a word, 
from left to right. Once there is a match of the regular pattern r, the greedy replacement performs the 
longest match, and the reluctant replaces the shortest. 

Definition 2.2. Let y, co G £* and r G R (with e f r). The reluctant replacement of r with co in y, denoted 
as Y, .oj’ is defined recursively as y r _ , 0) = { vcopf_^ w } where y = V /3 /./ . v 0 £*r£*. /? <G r, and for every 
Vi, V 2 ,j 8 i,j 82 ,jui,/i 2 G £* with v = V] V 2 , j8 = Pifc, M = jUijU 2 : if v 2 7 ^ £ then v 2 /3i 0 r and v 2 jSjUi £ r; 
and, if /3 2 / £ then P\^r. I 

Note that in the above definition, “if v 2 / £ then v 2 /3i 0 r and v 2 j3jti 0 r” enforces left-most match- 
ing”, i.e., there does not exist an earlier match of r than /3; similarly, “if ff f it then [f 0 r” enforces 
shortest matching, i.e., there does not exist a shorter match of r than /3. 

Definition 2.3. Let y, ft) 6 E* and r G R (with £ 0 r). The greedy replacement, denoted as y. 1 , ffl , is defined 
recursively as yt+g, = { vt/J/.t r L, C0 } where y = v/j p. v f £*r£*, /3 G r, and for every Vi, V 2 ,j8i,j8 2 ,jtti ,/t 2 G 
£* with v = ViV 2 , ft = /3i/3 2 , p = P 1 P 2 ' if v 2 / £ then v 2 /3i 0 r and if pi / £ then V 2 PP 1 0 r. I 

Example 2.4. Let y= aaa with a G £, (i) y fla _> fo = y+_^ = {fin}, = {fin}. (ii) y fl +^ 6 = 

{b,bb,bbb}, y+ + ^ h = {b}, and y~ + _^ = I 

Notice that in the above definitions, £ ^ r is required for simplicity. In practice, precise Perl/ Java 
regex semantics is followed for handling £ G r. For example, in SUSHI, given y = a, r = a*, and co = b, 
y r _ r0) = { bah } and yf (f0 = {/;/; }. When /J G yL >0) , we often abuse the notation and write it as /3 = y ( U w , 
given the following lemma. Similar applies to 

Lemma 2.5. For any y, o> G £* and r G R: lyt+^l = lyUrol = 1- 

In the following, we briefly introduce the notion of FST and its variation, using the terminology in 
[7]. We demonstrate its application to modeling the declarative replacement. 

Definition 2.6. Let £ e denote £U {£}. A finite state transducer (FST) is an enhanced two-taped nonde- 
terministic finite state machine described by a quintuple (£. Q.qo.F.8), where £ is the alphabet, Q the 
set of states, qo G Q the initial state, F C Q the set of final states, and 8 is the transition function, which 
is a total function of type Q x £ e x £ e — > 2 ^. I 

It is well known that each FST accepts a regular relation which is a subset of £* x £*. Given 
(Oi , ft >2 G £* and an FST we say (ft)] , ah) 6 if the word pair is accepted by ./#. Let ..-#3 be the 

composition of two FSTs and .-# 2 , denoted as ..#3 = .M\ f.-Wn. Then L{.Mf) = {(p.v) | (ju , 77 ) G 
L(^#t) and (rj . v) G Li-Mf) for some i] G £*}. We infioduce an equivalent definition of FST below. 

Definition 2.7. An augmented finite state transducer (AFST) is an FST (£, Q, qo, F, 8) with the transition 
function augmented to type Qx dtf —> 2®, where & is the set of regular relations over £. I 

In practice, we would often restrict the transition function of an AFST to the following two types: (1) 
Qx RxL* — > 2- ; . In a transition diagram, we label the arc from q, to qj for transition qj G 8(q,.r : m) 



Figure 1: An FST for .sy _, 0) 
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by r : to; and (2) Q x {hl(r) | 2^, where Id(r) = {(ft), ft)) | ft) G L(r) }. In a transition diagram, 

an arc of type (2) is labeled as Id{r). 

Now, we can use an AFST to model the declarative string replacement s r -, m for any ft) G E* and 
rGS (with £ 0 r). Figure 1 shows the construction, which presents an AFST that accepts {(s.rj) | s G 
E* and rj G .v, WOJ }. In other words, given any two s, ij G E*, we can use the AFST to check if 77 is a 
string obtained from s by replacing every occurrence of patterns in r with ft). We alternatively use FST 
and AFST for the time being without loss of generality. 

3 DFST and Finite Replacement 

This section shows that regular replacement with finite language pattern can be modeled using DFST, 
under certain restrictions. We fix the notation of DFST first. Intuitively, for a DFST, at any state q G Q 
the input symbol uniquely determines the destination state and the symbol on output tape. If there is a 
transition labeled with £ on the input, then this is the only transition from q. 

Definition 3.1. An FST srf = (E, Q.s^.F. 8) is deterministic if for any q G Q and any a G E the following 
is true. Let t\,ti G {a,e}, b\ ,£>2 G E e , and <71, <72 € Q. q\ = <72, fit = b^, and t\ = L if q\ G S(q,ti : b\) 
and <72 F <5(r/,G : £>2)- I 

Lemma 3.2. Let $ 0 E be an end marker. Given a finite regular expression r G R with £ ^ r and ah G E*, 
there exist DFST and g/ + s.t. for any ft), ft) 1 G E*: ft) 1 = iff (fi)$,ft)i$) G and, 

ft)l = ft),!^ iff (fi)$, ft)i$) G L(sr/ + ). 

We briefly describe how .c/ + is constructed for ( 0 ^ ah , similar is . Given a finite regular ex- 
pression r, and assume its length bound is n. Let E-" = |J 0<(<n E ! . Then .?/ + is defined as a quintuple 
(EU {$}, Q,qo,F,8). The set of states Q = {q \,. . . ,<7|£<n|} has |E- ra | elements, and let S3 : E-" -4 Q be 
a bijection. Let qo = S8(e) be the initial state and the only final state. A transition (q.q' ,a : b) is defined 
as follows for any q G Q and a G EU {$}, letting /3 = dd~ x {q)\ (case 1) if a ^ $ and |j8| <n, then b = £ 
and q' = ^(j8a); or (case 2) if a / $ and |j8| = n: if /3 0 rE*, then b = j5 [0] and q' = ^(j8[l : |j8|]a); 
otherwise, let /3 = /.( v where /./ is the longest match of r, then b = fth and q' = «S^(va); or (case 3) if 
a = $, then b = and = <70. 

Intuitively, the above algorithm simulates the left-most matching. It buffers the current string pro- 
cessed so far, and the buffer size is the length bound of r. Once the buffer is full (case 2), it examines 
the buffer and checks if there is a match. If not, it emits the first character and produces it as output; 
otherwise, it produces fth on the output tape. Clearly, £3 is feasible because of the bounded length of r. 

4 Procedural Replacement 

The modeling of procedural replacement is much more complex than that of the declarative semantics. 
The general idea is to compose a number of finite state transducers for generating and filtering begin 
and end markers for the regular pattern in the input word. We start with the reluctant semantics. Given 
reluctant replacement S r i(0 , the modeling consists of four steps. 

4.1 Modeling Left-Most Reluctant Replacement 

Step 1 (DFST Marker for End of Regular Pattern): The objective of this step is to construct a DFST 
(called .f/i ) that marks the end of regular pattern r, given S r _ oy We first construct a deterministic FSA 
sF that accepts r s.t. ,s/ does not have any £ transition. We use (q.a.q' ) to denote a transition from 
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c : c b :b 




Figure 2: DFST End Marker 

state q to q' that is labeled with a G X. Then we modify each final state / of FSA as below: (1) make 
/ a non-final state, (2) create a new final state f and establish a transition e from / to f, (3) for any 
outgoing transition ( f,a,s ) create a new transition ( f,a,s ) and remove that outgoing transition from / 
(keeping the e transition). Thus the £ transition is the only outgoing transition of /. Then convert the 
FSA into a DFST s/\ as below: for an £ transition, its output is $ (end marker); for every other transition, 
its output is the same as the input symbol. 

Example 4.1. A\ in Figure 2 is the DFST generated for regular expression cb + a + . I 

Step 2 (Generic End Marker): Note that on the input tape, s/\ only accepts r. We would like to 
generalize s/\ so that the new FST (called .fA) will accept any word on its input tape. For example, A 2 
in Figure 2 is a generalization of A\, and (ccbbaa,ccbba$a$) £ L{Ai). 

Step 2 is formally defined as follows. Given s/\ = (ZlJ {$}. <2i , ry/j . W| . 5] j as described inStep 1, s /2 
is a quintuple (£U{$}, 82 ). A labeling function 38 : (A — > 2 ® 1 is a bijection s.t. 38(qff) = { q l () } . 

For any t £ Qi and a £ £: t' £ 8 i(t,a : a) iff 38(f) = {A | Eli' £ 38 (t) s.t. s' £ 5] (s,a : a)} U {^q}- Cleai'ly, 
38 models a collection of states in s/\ that can be reached by the substrings consumed so far on .sA- Note 
that there is at most one state reached by a substring, because s/\ is deterministic. Hence, the collection 
of states is always finite. The handling of the only £ transition in s /2 is si mi lar. 

Example 4.2. A 2 in Figure 2 is the result of applying the above algorithm on A 1 . Cleai'ly, for A 2 , 
38(1) = {1}, 38(2) = {2,1}, and 38(3) = {3,1}. Running (ccbb, ccbb) on A 2 results a partial run to 
state 3. For (ccbb. ccbb). there are five substring pairs to be observed: (ccbb, ccbb), (ebb, ebb), (bb,bb), 
(b,b), and (e,e). Among them, only (ebb, ebb) and (e,e) can be extended to match r (i.e., cb 1 a + ). 
Clearly, if run them on A \ , they would result in partial runs that end at states 3 (by (ebb, ebb)) and 1 (by 
(£,£)). This is the intuition of having 38(3) = {3, 1} in A 2 . The labeling function 38 keeps track of the 
potential substrings of match by recording those states of A\ that could be reached by the substrings. I 

The following lemma states that inserts an end marker $ after each occurrence of regular pattern 
r, and there are no duplicate end markers inserted (even when empty string £ £ r). 

Lemma 4.3. For any r £ R there exists a DFST s /2 s.t. for any co £ £*, there is one and only one 
ftb £ (IU{$})* with (ft), ft h) £ L(s/i) and ft) = 71 ( 0 ) 2 ) such that fth satisfies the following: for any 
0 < x < | cm], cmfx] = $ iff 7r(a>2 [0, xc]) £ Z*r; and for any 1 < x < |ftb|, if fthW = then ft> 2 [x— 1] f $. 

Step 3 (Begin Marker of Regular Pattern): From s /2 we can construct a reverse transducer si 3 by 
reversing all transitions in s /2 and replacing the end marker $ with the begin marker #. Then create a 
new initial state so, add e transitions from so to each final state in si 2 , and make the original initial state 
of si 2 the final state in s/j. For example, the A 3 shown in Figure 3 is a reverse of A 2 in Figure 2. Cleai'ly, 
(aabbcc ,#a#abbcc) £ L(Af), and A 3 marks the beginning for pattern r = a + b + c. 

Lemma 4.4. For any r £ R there exists an FST s /3 s.t. for any /.( £ £*, there exists one and only one 
V £ (IU {#})* with (ju, v) £ L(s/f). v satisfies the following: (i) /r = jt(v), and, (ii) for 0 < i < |v|: 
v[i] = # iff 7T (v[i, | V |] ) £ rL*, and (iii) for 1 < i < |v|: if V [r] = # then v[i— 1] f #, 
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Figure 3 : Begin Marker and Reluctant Replacement Transducers 


The beauty of the nondeterminism is that 0/3 can always make the “smart” decision to enforce there 
is one and only one run which “correctly” inserts the label #. Any incorrect insertion will never reach a 
final state. The nondeterminism gives sfj the “look ahead” ability. 

Step 4 (Reluctant Replacement): Next we define an automaton for implementing the reluctant replace- 
ment semantics. Given a DFSA .M, let .M\ be the new automaton generated from .f/, by removing all 
the outgoing transitions from each final state of .M . We have the following result: L(-M \ ) = { .v | s <G 
) A W A .s’ : s' f L{,//) }. Clearly ,/f/\ implements the “shortest match” semantics. Given r £ R, 
let reluc(r) represent the result of applying the above “reluctant” transformation on r. 

We still need to filter the extra begin markers during the replacement process. Given a regular lan- 
guage Jz? = reluc(r), let Jz?# represent the language generated from Jz? by nondeterministically inserting 
#, i.e., Jz"# = {At | At 6 (I U {#})* A 7 t(p) £ Jz?}. Clearly, to recognize Jz?#, an automaton can be con- 
structed from .&/ (which accepts Jz?) by attaching a self loop transition (labeled with #) to each state of 
.c/. Let Jz#/ = Jzf# G Lj# (this is to avoid removing the begin marker for the next match). Now given reg- 
ular language Jz?# 7 and ft) £ it is straightforward to construct an FST sf eg < y w s.t. (jl. v) £ L(.f/^' x0) ) 
iff /./ £ Jz?#' and v = ft). Intuitively, given any /I (interspersed with #) that matches r, the FST replaces it 
with co. 

An automaton 3/4 (as shown in Figure 3 ) can be defined. Intuitively, .0/4 consumes a symbol on both 
the input tape and output tape unless encountering a begin marker #. Once a # is consumed, 0J4 enters 
the replacement mode, which replaces the shortest match of r with co (and also removes extra # in the 
match). Thus, piping it with .5/3 directly leads to the precise modeling of reluctant replacement. 

Lemma 4.5. Given any r £ R and co £ £*, and let be \ \42f4, then for any ft)i , ftb £ XJ: (ft)i , ftb) £ 
L(K) iff ftb = 

4.2 Modeling Left-Most Greedy Semantics 

Handling the greedy semantics is more complex. We have to insert both begin and end markers for the 
regular pattern and then apply a number of filters to ensure the longest match. The first action is to 
insert begin markers using .0*3 as described in the previous section. Then the second action is to insert 
an end marker $ nondeterministically after each substring matching r. Later, additional markers will be 
filtered, and improper marking will be rejected. We call this FST Given r £ R and co £ £*, .0/' 
can be constructed so that for any (0[ £ IJ; an d &b £ (®ii®2) £ L(M '[ ) iff (i) n((Q\) = 7 z((Ch), and 

(ii) for any 0 < i < |ftb|, 7 T(ftbfO./]) £ E*r if Oh [;] = $, and (iii) for any 1 <i< |ftb|, if ftb/] = $ then 
fth [/ — 1 ] f $■ Notice that .c// is different from .0? in that the $ after a match of r is optional. Clearly, 
S&2 can be modified from .eA by simply adding an e : e transition from / (old final state) to f (new final 
state) in sfi, e.g., to add an e : e transition from state 4 to 5 in A 2 in Figure 2 . Also #:# transitions are 
needed for each state to keep the # introduced by .2/3. 

Then we need a filter to remove extra markers so that every $ is paired with a #. Note we do not 
yet make sure that between the pair of # and $, the substring is a match of r. We construct the AFST 
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as follows. Let srff = (JL 2 ,Q,qo,F,8). Q has two states qo and q\. F = { q $ } . The transition function 
8 is defined as below: (i) 8 (qo,Id(L)) = {go}> (h) 8 (qo,$ : e) = {go}, (iii) <5(go,# : #) = {q\}, (iv) 
8 (qi,# : e) = {q\}, (v)<5(gi,M(£)) = {c/i}, (vi) <5(gi,$ : $) = {g 0 }- 

Now we will apply three FST filters (represented by three identity relations Id(L\), Id(L 2 ), and 
Id(L^)), for filtering the nondeterministic end marking. L\, L 2 , and L3 arc defined as below: 

L x = £*#(r 0 L*)$L* 2 (1) 

L 2 = £^[ A #]($ n r#,$) n £^#]($££* n r#, $ £$) (2) 

L 3 = £?#(r#,$ n (£*$(£+) #i$ ))£^ (3) 

The intuition of L\ is to make sure that the substring between each pair of # and $ is a match of r. 

The motivation of L 2 is for preventing removing too many # symbols by s/f (due to improper insertion 
of end markers by ,0^). Id(L 2 ) handles two cases: (1) to avoid removing the begin markers at the end of 
input word if the pattern r includes e\ and (2) to avoid removing begin markers for the next instance of r. 
Consider the following example for case (2): given S~ >c and the input word bab , the correct marking of 
begin and end markers should be #$b#a$#$b#$ (which leads to cbccbc as output). However the following 
incorrect marking could pass Id(L \ ) and Id(L 3), if not enforcing the Id (La) filter: #$b#a$b#$. The cause 
is that an ending marker $ (e.g., the one before the last b ) may trigger .?//■ to remove a good begin marker 
# that precedes an instance of r (i.e., e). Filter Id(L 2 ) is thus defined for preventing such cases. 

Finally, L3 is defined for ensuring longest match. Note that filter Id(L 3) will be applied after ld(L\) 
and Id(L 2 ) which have guaranteed the pairing of markers and the proper contents between each pair of 
markers. L3 eliminates cases where stalling from # there is a substring (when projected to £) matches 
r and the string contains at least one $ inside (implying that there is a longer match than the substring 
between the # and its matching $). Note that (£ + )#.$ refers to a word in £ + interspersed with begin/end 
markers, i.e., for any co G (£ + )#$, \n(o))\ > 0. We also need an FST which is very similar to 
si 4 enters (and leaves) the replacement mode, once it sees the begin (and the end) marker. Then we have 
the following: 

Lemma 4.6. Given any r G R and co G £*, let s/ g be then for 

any G £*: cm = C0it->co iff (®i, W2) G L(s/ g ). 

5 SISE Constraint and SUSHI Solver 

This work is implemented as paid of a constraint solver called SUSHI [4], which solves SISE (Simple 
Lineal - String Equation) constraints. Intuitively, a SISE equation can be regarded as a variation of word 
equation [13]. It is composed of word literals, string variables, and various frequently seen string op- 
erators such as substring, concatenation, and regular replacement. To solve SISE, an automata based 
approach is taken, where a SISE is broken down into a number of atomic string operations. Then the 
solution process consists of a number of backward image computation steps. We now briefly describe 
the part related to regular replacement. 

It is well known that projecting an FST to its input tape (by removing the output symbol from each 
transition) results in a standard finite state machine. Similar applies to the projection to output tape. 
We use input(..c/) and output (.0/ ) to denote the input and output projection of an FST .<?/. Given 
an atomic SISE constraint x r ^ m = r 2 , the solution pool of v (backward image of the constraint) is de- 
fined as {/.( | , m G L(r 2 )}. Given a regular expression v, the forward image of v r , 0) is defined as 
{M I yt G (X, , w and a G v}. Clearly, let sf be the corresponding FST of x r _>®, the backward image can 
be computed using input (s/\ \Id(r 2 )). Similarly, given the forward image is output (Id(fx)\\s/). 
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Figure 4: SUSFII FST Transition Set 

5.1 Compact Representation of FST 


SUSFII relies on dk . brics . automaton [15] for FSA operations. We use a self-made Java package for 
supporting FST operations [16]. Note that there are existing tools related to FST, e.g., the FSA toolbox 
[16]. In practice, to perform inspection on user input, FST has to handle a large alphabet represented 
using 16-bit Unicode. In the following, we introduce a compact representation of FST. A collection of 
FST transitions can be encoded as a SUSHI FST Transition Set (SFTS) in the following form: 


& = (q,q ',0 ■ <P) 


where q, q' are the source and destination states, the input charset 0 = n \ ,nF\ with 0 < n\ < n 2 represents 
a range of input characters, and the output charset (p = [mi, m 2 ] with 0 < m 1 < m 2 represents a range 
of output characters. FT includes a set of transitions with the same source and destination states: FT = 
{(q,q’,a : b) | a € 0 and h <E (p\. For FT = (q.q 1 .Ft : c p ), however, it is required that if \(j)\ > 1 and 
\<P\ > 1, then 0 = tp. For 0 and <p, e is represented using [-1,-1], Thus, there arc three types of SFTS 
(excluding the e cases), as shown in the following. Type I: \<j)\ > 1 and |<p| = 1, thus FT = {(q,q' , a : 
b) | a £ 0 and (p = {£}}. Type II: |0| = 1 and |<p| > 1, thus FT = {(q,q',a : b) \ b 6 (p and 0 = {a}}. 
Type III: \(p\ = |0| 1, thus FT — {ill FI : o) | fl 6 ^}. The top-left of Figure 4 gives an intuitive 

illustration of these SFTS types (which relates the input and output chars). 

The algorithms for supporting FST operations (such as union, Kleen star) should be customized 
correspondingly. In the following, we take FST composition as one example. Let sT = (£. Q.q. F.8) 
be the composition of sT\ = (£, Q\ ,q ] () ,F] , 5] ) and sTi = (T.,Q 2 ,Sq,F 2 , 82 ). Given X\ = {t\ , 0i : <pi) in 
M and T 2 = (t 2 ,t f 2 ,<h '■ (pi) in ^ 2 , where <Pi fl (p 2 / 0. an SFTS z = (si,S 2 ,<l> ■ <P) is defined for stf s.t. 
si = (t\ .L), S 2 = and the input/output charset of z is defined as the table in Figure 4 (note all 

entries except for (I, II) produce one SFTS only). For example, when both Z\ and Zi arc type I, we have 
0 = 0 1 and (p = (p 2 - The bottom left of Figure 4 shows the intuition of the algorithm. The dashed circles 
represent the corresponding input/output charset. 


5.2 Evaluation 


We arc interested in whether the proposed technique is efficient and effective in practice. We list here 
four SISE equations for stress-testing the SUSHI package. Note that each equation is parametrized by 
an integer n. eql: x++^ b{nn] = b{2n,2n}; eq2: = b{2n,2n}; e q3: = b{2n,2n}; eq4: 

x I*->b{nn} ~ b {2«,2n}. The following table displays the running results when n is 41. (more data in [4]). 
It displays the max size of FST and FSA used in the solution process. 


Equation 

FST States 

FST Transitions 

FSA States 

FSA Transitions 

Time (Seconds) 

eq 1(41) 

5751 

16002 

125 

207 

155.281 

eq2(41) 

5416 

5748 

83 

124 

162.469 

eq3(41) 

631 

1565 

2 

2 

492.281 

eq4(41) 

126 

177 

0 

0 

14.016 
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The technique scales well in practice. We applied SUSF1I in discovering SQL injection vulnerabilities 
and XSS attacks in FLEX SDK (see technical report [4]). The running cost ranges from 1.4 to 74 seconds 
on a 2.1Ghz PC with 2GB RAM (with SISE equation size ranging from 17 to 565). 1 

6 Related Work 

Recently, string analysis has received much attention in security and compatibility analysis of programs 
(see e.g., [5, 12]). In general, there arc two interesting directions of string analysis: (1 ) forward analysis, 
which computes the image (or its approximation) of the program states as constraints on strings; and, 
(2) backward analysis, which usually stalls from the negation of a property and computes backward. 
Most of the related work (e.g., [2, 11, 18]) falls into the category of forward analysis. This work can be 
used for both forward and backward image computation. Compared with forward analysis, it is able to 
generate attack signatures as evidence of vulnerabilities. 

Modeling regular replacement distinguishes our work from several related work in the area. For ex- 
ample, one close work to ours is the HAMPI string constraint solver [9]. HAMPI supports solving string 
constraints with context-free components, which are unfolded to regular language. HAMPI, however, 
supports neither constant string replacement nor regular replacement, which limits its ability to reason 
about sanitation procedures. Similarly, Hooimeijer and Weimer’s work [6] in the decision procedure for 
regular constraints does not support regular replacement. A closer work to ours is Yu’s automata based 
forw ard/backw ard string analysis [18]. Yu uses a language based replacement [17], which introduces im- 
precision in its over-approximation. Conversely, our analysis considers the delicate differences among 
the typical regular replacement semantics and provides more accurate analysis. In [1], Bjprner et al. uses 
first order formula on bit-vector to model string operations except replacement. We conjecture that it can 
be extended by using recursion in their first order framework for defining replaceAll semantics. 

FST is the major modeling tool in this paper. It is mainly inspired by [7, 14, 8] in computational 
linguistics, for processing phonological and morphological rules. In [8], an informal discussion was 
given for the semantics of left-most longest matching of string replacement. This paper has given the 
formal definition of replacement semantics and has considered the case where e is included in the search 
pattern. Compared with [7] where FST is used for processing phonological rules, our approach is lighter 
given that we do not need to consider the left and right context of re-writing rules in [7]. Thus more 
DFST can be used, which certainly has advantages over NFST, because DFST is less expressive. For 
example, in modeling the reluctant semantics, compared with [7], our algorithm does not have to non- 
deterministically insert begin markers and it does not need extra filters, thus more efficient. It is in- 
teresting to compare the two algorithms and measure the gain in performance by using more DFST in 
modeling, which remains one of our future work. 

Limitation of the Model: It is shown in [4] that solving SISE constraint is decidable (with worst 
complexity 2-EXPTIME). This may seem contradictory with the conclusion in [1], The decidability 
is achieved by restricting SISE as described below. SISE requires that each variable appeal's at most 
once and all variables must be appeal' in LHS. This permits a simple recursive algorithm that reduces 
the solution process into a number of backward image computation steps. However, it may limit the 
expressiveness of SISE in certain application scenario. SISE supports regular replacement, substring, 
concatenation operators, however, it does not support operators related to string length, e.g., indexOf and 
length operators. It is interesting to extend the framework to support mixed numerical and string opera- 
tors, e.g., encoding numeric constraints using automata as described by Yu et al. in [18], or translating 
string constraints to first order formula on bit- vectors as shown by Bjprner et al. [1]. 

*SISE equation size is measured by the combined length of constant words, variables, and operators included in the 
equation. 
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7 Conclusion 

This paper presents the finite state transducer models of various regular substitutions, including the 
declarative, finite, reluctant, and greedy replacement. A compact FST representation is implemented 
in a constraint solver SUSHI. The presented technique can be used for analyzing programs that process 
text and communicate with users using strings. Future directions include modeling mixture of greedy 
and reluctant semantics, handling hybrid numeric/string constraints, and context free components. 
Acknowledgment: This paper is inspired by the discussion with Fang Yu, Tevfik Bultan, and Oscar 
Ibarra. We thank the anonymous reviewers for very constructive comments that help improve the paper. 
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